iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 15
1
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 15

Day15 Numerical Data 1/2 replace N/A or outlier 數值型特徵 1/2 填補N/A與離群值

  • 分享至 

  • xImage
  •  

在Day04的文章中介紹了幾種常見可供替補N/A或離群值的數值,本日文章來實際操做,以Kaggle競賽Titanic: Machine Learning from Disaster作為使用的資料集演示。

In the Day04 article we talked about several values that could be used to fill N/As and Outliers. Today, we are going to show how to actually replace missing and extreme data with those values using the data downloaded from Titanic: Machine Learning from Disaster.

import pandas as pd
import numpy as np
import copy

df = pd.read_csv('data/train.csv') # 讀取檔案 read in the file
df.head() # 顯示前五筆資料 show the first five rows

https://ithelp.ithome.com.tw/upload/images/20190915/20119709ADhYeAtB0X.jpg

# 只取int64, float64兩種數值型欄位存到 num_features中 
# save the columns that only contains int64, float64 datatypes into num_features
num_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64' or dtype == 'int64':
        num_features.append(feature)
print(f'{len(num_features)} Numeric Features : {num_features}')

https://ithelp.ithome.com.tw/upload/images/20190915/201197093J7kZOvxI3.jpg

# 去掉文字型欄位,只留數值型欄位 only keep the numeric columns
df = df[num_features]
df.head()

https://ithelp.ithome.com.tw/upload/images/20190915/20119709pFSmU3CLJI.jpg

# 檢查欄位缺值數量 check N/As
df.isnull().sum().sort_values(ascending=False)

https://ithelp.ithome.com.tw/upload/images/20190915/20119709Xd1pr0IrCt.jpg

以平均值填補空值

df_mn = df.fillna(df.mean())
df_mn['Age']

https://ithelp.ithome.com.tw/upload/images/20190915/20119709T7UpN0BEcd.jpg

以中位數填補空值

df_md = df.fillna(df.median())
df_md['Age']

https://ithelp.ithome.com.tw/upload/images/20190915/201197092CuB9E9wVk.jpg

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] 第二屆機器學習百日馬拉松內容

[2] Titanic: Machine Learning from Disaster


上一篇
Day14 Feature Engineering, Kurtosis and Skewness 淺談特徵工程、峰度與偏度
下一篇
Day16 Numerical Data 2/2 reduce skewness 數值型特徵 2/2 去除偏態
系列文
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言